Summary

This report documents the steps to prepare genotype data for imputation using the TOPMed Imputation Server:

  • SNP-level quality control using PLINK
  • Conversion to VCF format
  • (Optional) Liftover to GRCh38
  • Sorting, compressing, and indexing VCF files
  • Preparing files for upload

Required Tools and Files

Software Tools

Tool Purpose Install Command
PLINK Quality control, format conversion conda install -c bioconda plink
vcf-sort Sort VCF files (via htslib or vcftools) Included in vcftools or use bcftools sort
bgzip Compress VCF files to .vcf.gz conda install -c bioconda htslib
tabix Index compressed VCFs conda install -c bioconda htslib
Perl Required to run HRC-1000G-check-bim.pl Pre-installed on most systems

Reference Files

File Description Command
HRC-1000G-check-bim.pl Script to harmonize SNP positions/alleles
HRC.r1-1.GRCh37.wgs.mac5.sites.tab Reference SNP list for QC

Preparing pre-Imputation data

Additional SNP-Level Quality Control

plinkFile <- "ADNI1_QC_FINAL"
dataDir <- getwd()
setwd(dataDir)

# Read BIM file
bim <- read.table(paste0(plinkFile, ".bim"), header = FALSE, stringsAsFactors = FALSE)
colnames(bim) <- c("CHR", "SNP", "CM", "BP", "A1", "A2")

# Filter chromosomes 1–22 and X
bim <- bim[bim$CHR %in% c(as.character(1:22), "X"), ]

# Filter alleles to A/C/G/T only
valid_alleles <- c("A", "C", "T", "G")
bim <- bim[bim$A1 %in% valid_alleles & bim$A2 %in% valid_alleles, ]

# Remove duplicated positions
dup_pos <- bim$BP[duplicated(bim$BP)]
bim <- bim[!bim$BP %in% dup_pos, ]

# Save valid SNPs
write.table(bim$SNP, "ValidSNPs.txt", quote = FALSE, row.names = FALSE, col.names = FALSE)

Convert to Binary Format

system(paste("plink --file", plinkFile, "--output-chr M --make-bed --out", plinkFile))

Allele Frequency Calculation

system(paste("plink --bfile", plinkFile, "--freq --out", plinkFile))

BIM Check

hrc_script <- "/path/to/HRC-1000G-check-bim.pl"
hrc_ref <- "/path/to/HRC.r1-1.GRCh37.wgs.mac5.sites.tab"

system(paste("perl", hrc_script,
             "-h -r", hrc_ref,
             "-b", paste0(plinkFile, ".bim"),
             "-f", paste0(plinkFile, ".frq"),
             "-c -p EUR -o"))

Run SNP Update Script

system("chmod 755 Run-plink.sh")
system("./Run-plink.sh")

Save Reference Alleles

for (i in 1:22) {
  bim_chr <- read.table(paste0(plinkFile, "-updated-chr", i, ".bim"), header = FALSE)
  write.table(bim_chr[, c(2, 6)], paste0("snps_", i, ".txt"),
              quote = FALSE, row.names = FALSE, col.names = FALSE, sep = "\t")
}

Convert to Sorted VCF and Index

for i in {1..22}; do
  vcf-sort ADNI1_QC_FINAL-updated-chr$i.vcf | bgzip -c > ADNI1-updated-chr$i.vcf.gz
  tabix -p vcf ADNI1-updated-chr$i.vcf.gz
  echo "Processed chr$i"
done

Output pre-Imputation

Each chromosome will produce:

  • ADNI1-updated-chr<i>.vcf.gz
  • ADNI1-updated-chr<i>.vcf.gz.tbi

These files can now be uploaded to the TOPMed or Michigan Imputation Server.


Running Imputation


Preparing imputed data

Decompress results

for i in `ls *.zip`
do 
  unzip -P XXXXXXX $i #Password Imputation Server
done

Merge chromosome datasets in one file

for i in {1..22}
do 
  plink --vcf chr$i.dose.vcf.gz --make-bed --double-id --out chr$i.final 
  echo chr$i.final >> merge.list
done

plink --merge-list merge.list --make-bed --out ADNI1.merged

Annotate to rsID format (optional)

plink --bfile ADNI1.merged --recode vcf bgz --out ADNI1.impQC
tabix -p vcf ADNI1.impQC.vcf.gz

bcftools annotate --annotations /nfs/users2/rg/nvilortejedor/ALFA-GWAS/HRC_Imputation/annotation_GRCh37p13/All_20180423.vcf.gz --columns ID --threads 20 -O z -o ADNI1.impQC.rs.vcf.gz ADNI1.impQC.vcf.gz 

Additional post-Imputation QC

[Up to the user] We normally check: Imputation quality, MAF, HWE, …

Remove intermediate files

rm ADNI1.impQC*.vcf.gz*

Output After Imputation

Each chromosome will produce:

  • ADNI1.impQC.rs.bed
  • ADNI1.impQC.rs.bim
  • ADNI1.impQC.rs.fam

Acknowledgements

Organized by Alzheimer’s Association, ISTAART Neuroimaging PIA. Working group Brain Imaging Genetics.

Special thanks to ADNI for providing the datasets.

© 2025 AAIC Workshop Basics of Genetics • Maintained by @GeneticNeuroStats